Objective

  • Disclaimer: non-comprehensive introduction to RNA-sequencing
  • Introduce preprocessing steps
  • Visualization
  • Analytical methods
  • Common software tools

Steps to an RNA-seq Analysis (Literacy)

  1. Preprocessing and QC:
    • Fasta and Fastq files
    • FastQC: good vs. bad examples
    • Visualization
  2. Alignment
    • Obtaining genome sequence and annotation
    • Software: Bowtie, TopHat, STAR, Subread/Rsubread
  3. Expression Quantification
    • Count reads hitting genes, etc
    • Approaches/software: HT-Seq, STAR, Cufflinks, RPKM FPKM or CPM, RSEM, edgeR, findOverlaps (GenomicRanges). featureCounts (Rsubread)
  4. More visualization
    • Heatmaps, boxplots, PCA, t-SNE, UMAP
  5. Differential Expression
    • Batch correction
    • Overdispersion
    • General Workflow
    • Available tools: edgeR, DESeq, Limma/voom
    • Even more visualization!!

Illumina Sequencing Workflow

Figure 1: Illumina Sequencing Workflow

Sequencing Data Formats

Genome sequcencing data is often stored in one of two formats, FASTA and FASTQ text files. For example a FASTA files looks like the following:

FASTA Files

Figure 2: FASTA file format

FASTQ Files

We can also store confidence or quality scores using a FASTQ format: Figure 3: FASTQ file format

FASTQ Encoding

In order to translate FASTQ quality scores:

Figure 4: Fastq Encoding

FASTQ Probability

And now converting to confidence probabilities:

Figure 5: Fastq encoding quality scores

Preprocessing and QC using FASTQC

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do QC checks on raw sequence data:

  • Import of data from BAM, SAM or FastQ files
  • Quick overview and summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application

FastQC Scores

Figure 5: Fastqc scores

FastQC Score Distribution

Figure 6: Fastqc score distribution

FastQC Base Distribution

Figure 7: Fastqc base distribution

FastQC N Distribution

Figure 8: Fastqc N distribution

Alignment to the Reference Genome

Find the genomic Location of origin for the sequencing read. Software: Bowtie2, TopHat, STAR, Subread/Rsubread, many others!

Figure 9: Sequence read alignment

Here is quick tutorial on sequnce aligment.

Using Rsubread to do Alignment

The following userguide will be helpful for you:

http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf

Indexing your genome

Abraham Lincoln: “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” (4 minutes indexing the genome, 2 minutes aligning the reads)

Note that you will rarely do this for human alignment. You will usually download an existing index given to you by others who have already done this work. You will do this often if you are aligning microbial reads, e.g. MTB or some other organism for which others have not already made your index for you.

buildindex(basename="genome/ucsc.hg19.chr1_120-150M",reference="genome/ucsc.hg19.chr1_120-150M.fasta.gz")
## WARNING: your system seems to be 32-bit. Rsubread supports 32-bit systems to a very limited level.
## It is highly recommended to run Rsubread on a 64-bit system to avoid errors.
## 
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.12.3
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## ||                Index name : ucsc.hg19.chr1_120-150M                        ||
## ||               Index space : base space                                     ||
## ||               Index split : no-split                                       ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||              Gapped index : no                                             ||
## ||                                                                            ||
## ||       Free / total memory : 11.1GB / 24.0GB                                ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o ucsc.hg19.chr1_120-150M.fasta.gz             ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || There were 4800000 notes for reference sequences.                          ||
## || The notes can be found in the log file, 'genome/ucsc.hg19.chr1_120-150 ... ||
## || Scan uninformative subreads in reference sequences ...                     ||
## || 516 uninformative subreads were found.                                     ||
## || These subreads were excluded from index building.                          ||
## || Estimate the index size...                                                 ||
## ||    8%,   0 mins elapsed, rate=18922.0k bps/s                               ||
## ||   16%,   0 mins elapsed, rate=26103.2k bps/s                               ||
## ||   24%,   0 mins elapsed, rate=29874.7k bps/s                               ||
## ||   33%,   0 mins elapsed, rate=32146.7k bps/s                               ||
## ||   41%,   0 mins elapsed, rate=33710.9k bps/s                               ||
## ||   49%,   0 mins elapsed, rate=34838.4k bps/s                               ||
## ||   58%,   0 mins elapsed, rate=35686.4k bps/s                               ||
## ||   66%,   0 mins elapsed, rate=36346.2k bps/s                               ||
## ||   74%,   0 mins elapsed, rate=36882.6k bps/s                               ||
## ||   83%,   0 mins elapsed, rate=33709.4k bps/s                               ||
## ||   91%,   0 mins elapsed, rate=30610.5k bps/s                               ||
## || 3.0 GB of memory is needed for index building.                             ||
## || Build the index...                                                         ||
## ||    8%,   0 mins elapsed, rate=3896.2k bps/s                                ||
## ||   16%,   0 mins elapsed, rate=6322.5k bps/s                                ||
## ||   24%,   0 mins elapsed, rate=7978.8k bps/s                                ||
## ||   33%,   0 mins elapsed, rate=9178.5k bps/s                                ||
## ||   41%,   0 mins elapsed, rate=10088.6k bps/s                               ||
## ||   49%,   0 mins elapsed, rate=10804.1k bps/s                               ||
## ||   58%,   0 mins elapsed, rate=11376.0k bps/s                               ||
## ||   66%,   0 mins elapsed, rate=11850.6k bps/s                               ||
## ||   74%,   0 mins elapsed, rate=12247.5k bps/s                               ||
## ||   83%,   0 mins elapsed, rate=11620.6k bps/s                               ||
## ||   91%,   0 mins elapsed, rate=10905.1k bps/s                               ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.2 minutes.                      ||
## ||        Index genome/ucsc.hg19.chr1_120-150M was successfully built.        ||
## ||                                                                            ||
## \\============================================================================//

Took me ~0.2 minutes

Aligning your reads:

Note that this outputs results in a .bam file and not a .sam file

align(index="genome/ucsc.hg19.chr1_120-150M",readfile1="reads/R01_10_short500K.fq.gz",output_file="alignments/R01_10_short.bam", nthreads=4) 
## WARNING: your system seems to be 32-bit. Rsubread supports 32-bit systems to a very limited level.
## It is highly recommended to run Rsubread on a 64-bit system to avoid errors.
## 
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.12.3
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (RNA-Seq)                                   ||
## || Input file    : R01_10_short500K.fq.gz                                     ||
## || Output file   : R01_10_short.bam (BAM)                                     ||
## || Index name    : ucsc.hg19.chr1_120-150M                                    ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 4                                  ||
## ||                          Phred offset : 33                                 ||
## ||                             Min votes : 3 / 10                             ||
## ||                        Max mismatches : 3                                  ||
## ||                      Max indel length : 5                                  ||
## ||            Report multi-mapping reads : yes                                ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================ Running (20-Jun-2023 02:32:35, pid=26105) =================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [2,37]                   ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||    5% completed, 1.8 mins elapsed, rate=229.7k reads per second            ||
## ||   11% completed, 1.8 mins elapsed, rate=250.9k reads per second            ||
## ||   18% completed, 1.8 mins elapsed, rate=258.1k reads per second            ||
## ||   25% completed, 1.8 mins elapsed, rate=262.2k reads per second            ||
## ||   31% completed, 1.8 mins elapsed, rate=264.6k reads per second            ||
## ||   38% completed, 1.8 mins elapsed, rate=266.4k reads per second            ||
## ||   45% completed, 1.8 mins elapsed, rate=267.3k reads per second            ||
## ||   51% completed, 1.8 mins elapsed, rate=268.3k reads per second            ||
## ||   58% completed, 1.8 mins elapsed, rate=268.8k reads per second            ||
## ||   65% completed, 1.8 mins elapsed, rate=269.6k reads per second            ||
## ||   69% completed, 1.8 mins elapsed, rate=3.2k reads per second              ||
## ||   73% completed, 1.8 mins elapsed, rate=3.3k reads per second              ||
## ||   76% completed, 1.8 mins elapsed, rate=3.5k reads per second              ||
## ||   79% completed, 1.8 mins elapsed, rate=3.7k reads per second              ||
## ||   83% completed, 1.8 mins elapsed, rate=3.8k reads per second              ||
## ||   86% completed, 1.8 mins elapsed, rate=4.0k reads per second              ||
## ||   89% completed, 1.8 mins elapsed, rate=4.1k reads per second              ||
## ||   93% completed, 1.8 mins elapsed, rate=4.2k reads per second              ||
## ||   96% completed, 1.8 mins elapsed, rate=4.4k reads per second              ||
## ||   99% completed, 1.8 mins elapsed, rate=4.5k reads per second              ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||                 Total reads : 503,568                                      ||
## ||                      Mapped : 376,964 (74.9%)                              ||
## ||             Uniquely mapped : 317,555                                      ||
## ||               Multi-mapping : 59,409                                       ||
## ||                                                                            ||
## ||                    Unmapped : 126,604                                      ||
## ||                                                                            ||
## ||                      Indels : 2,236                                        ||
## ||                                                                            ||
## ||                Running time : 1.8 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//
##                       R01_10_short.bam
## Total_reads                     503568
## Mapped_reads                    376964
## Uniquely_mapped_reads           317555
## Multi_mapping_reads              59409
## Unmapped_reads                  126604
## Indels                            2236

My laptop is an Apple M2, which has 8 cores (used 4 cores), 24GB RAM

Took 15.7 minutes to align ~60M reads to the 30M bases

Took 0.7 minutes to align ~6.5M reads to the 30M bases

Took 0.3 minutes to align ~500K reads to the 30M bases

Aligned Sequencing Data Formats (SAM and BAM)

Note that Rsubread outputs a .bam file (bam = binary alignment map) and not a .sam file (sam = sequence alignment map). Here is some information about a .sam file:

https://en.wikipedia.org/wiki/SAM_(file_format)

Figure 10: SAM and BAM file format

https://samtools.github.io/hts-specs/SAMv1.pdf

To convert .sam to .bam or vice versa, a package called Rsamtools. Using Rsamtools, you can convert bam to sam as follows:

asSam("alignments/R01_10_short.bam", overwrite=T) 
## [1] "alignments/R01_10_short.sam"
# To convert to bam:
#asBam("alignments/R01_10_short.bam") 

Makes a system call to the Mac terminal to generate a .sam file

Feature counts

Now we can count reads hitting genes. Approaches/software:

  • HT-Seq
  • STAR
  • Cufflinks
  • RPKM FPKM or CPM
  • RSEM
  • edgeR
  • findOverlaps (GenomicRanges)
  • featureCounts (Rsubread)

Figure 11: Feature Counts

fCountsList = featureCounts("alignments/R01_10_short.bam", annot.ext="genome/genes.chr1_120-150M.gtf", isGTFAnnotationFile=TRUE)
## Warning: your system seems to be 32-bit. Rsubread supports 32-bit systems to a limited level only.
## We recommend that Rsubread be run on 64-bit systems to avoid any possible problems.
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.12.3
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 1 BAM file                                       ||
## ||                                                                            ||
## ||                           R01_10_short.bam                                 ||
## ||                                                                            ||
## ||              Paired-end : no                                               ||
## ||        Count read pairs : no                                               ||
## ||              Annotation : genes.chr1_120-150M.gtf (GTF)                    ||
## ||      Dir for temp files : .                                                ||
## ||                 Threads : 1                                                ||
## ||                   Level : meta-feature level                               ||
## ||      Multimapping reads : counted                                          ||
## || Multi-overlapping reads : not counted                                      ||
## ||   Min overlapping bases : 1                                                ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file genes.chr1_120-150M.gtf ...                           ||
## ||    Features : 1918                                                         ||
## ||    Meta-features : 104                                                     ||
## ||    Chromosomes/contigs : 1                                                 ||
## ||                                                                            ||
## || Process BAM file R01_10_short.bam...                                       ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 503568                                               ||
## ||    Successfully assigned alignments : 4561 (0.9%)                          ||
## ||    Running time : 0.00 minutes                                             ||
## ||                                                                            ||
## || Write the final count table.                                               ||
## || Write the read assignment summary.                                         ||
## ||                                                                            ||
## \\============================================================================//
featureCounts = cbind(fCountsList$annotation[,1], fCountsList$counts)

write.table(featureCounts, "alignments/R01_10_short.features.txt", sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)

SCTK

Use the Single Cell Toolkit to analyze your RNA-seq data!

#install.packages("devtools")
#devtools::install_github("wevanjohnson/singleCellTK")
library(singleCellTK)
singleCellTK()

### open features_combined.txt
### and meta_data.txt